AITopics | training data distribution

Collaborating Authors

training data distribution

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Changing the Training Data Distribution to Reduce Simplicity Bias Improves In-distribution Generalization

Neural Information Processing SystemsMay-27-2025, 06:38:21 GMT

generalization performance, in-distribution generalization, training data distribution, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Reviews: Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Neural Information Processing SystemsJan-26-2025, 21:33:36 GMT

Summary: This paper proposes a new algorithm that help stabilize off-policy Q-learning. The idea is to introduce approximate Bellman updates that are based on constraint actions sampled only from the support of the training data distribution. The paper shows the main source of instability is the boostrapping error. The boostrapping process might use actions that do not lie in the training data distribution. This work shows a way to mitigate this issue.

bootstrapping error reduction, constraint, off-policy q-learning, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.40)

Add feedback

Pictures Of MIDI: Controlled Music Generation via Graphical Prompts for Image-Based Diffusion Inpainting

Hawley, Scott H.

arXiv.org Artificial IntelligenceJul-1-2024

Recent years have witnessed significant progress in generative models for music, featuring diverse architectures that balance output quality, diversity, speed, and user control. This study explores a user-friendly graphical interface enabling the drawing of masked regions for inpainting by an Hourglass Diffusion Transformer (HDiT) model trained on MIDI piano roll images. To enhance note generation in specified areas, masked regions can be "repainted" with extra noise. The non-latent HDiTs linear scaling with pixel count allows efficient generation in pixel space, providing intuitive and interpretable controls such as masking throughout the network and removing the need to operate in compressed latent spaces such as those provided by pretrained autoencoders. We demonstrate that, in addition to inpainting of melodies, accompaniment, and continuations, the use of repainting can help increase note density yielding musical structures closely matching user specifications such as rising, falling, or diverging melody and/or accompaniment, even when these lie outside the typical training data distribution. We achieve performance on par with prior results while operating at longer context windows, with no autoencoder, and can enable complex geometries for inpainting masks, increasing the options for machine-assisted composers to control the generated music.

accompaniment, diffusion model, music generation, (13 more...)

arXiv.org Artificial Intelligence

2407.01499

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.50)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Make the Most of Your Data: Changing the Training Data Distribution to Improve In-distribution Generalization Performance

Nguyen, Dang, Haddad, Paymon, Gan, Eric, Mirzasoleiman, Baharan

arXiv.org Artificial IntelligenceApr-26-2024

Can we modify the training data distribution to encourage the underlying optimization method toward finding solutions with superior generalization performance on in-distribution data? In this work, we approach this question for the first time by comparing the inductive bias of gradient descent (GD) with that of sharpness-aware minimization (SAM). By studying a two-layer CNN, we prove that SAM learns easy and difficult features more uniformly, particularly in early epochs. That is, SAM is less susceptible to simplicity bias compared to GD. Based on this observation, we propose USEFUL, an algorithm that clusters examples based on the network output early in training and upsamples examples with no easy features to alleviate the pitfalls of the simplicity bias. We show empirically that modifying the training data distribution in this way effectively improves the generalization performance on the original data distribution when training with (S)GD by mimicking the training dynamics of SAM. Notably, we demonstrate that our method can be combined with SAM and existing data augmentation strategies to achieve, to the best of our knowledge, state-of-the-art performance for training ResNet18 on CIFAR10, STL10, CINIC10, Tiny-ImageNet; ResNet34 on CIFAR100; and VGG19 and DenseNet121 on CIFAR10.

dataset, difficult feature, training data distribution, (12 more...)

arXiv.org Artificial Intelligence

2404.17768

Country: North America > United States > California > Los Angeles County > Los Angeles (0.04)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

On the Limitation of Diffusion Models for Synthesizing Training Datasets

Yamaguchi, Shin'ya, Fukuda, Takuma

arXiv.org Artificial IntelligenceNov-21-2023

Synthetic samples from diffusion models are promising for leveraging in training discriminative models as replications of real training datasets. However, we found that the synthetic datasets degrade classification performance over real datasets even when using state-of-the-art diffusion models. This means that modern diffusion models do not perfectly represent the data distribution for the purpose of replicating datasets for training discriminative tasks. This paper investigates the gap between synthetic and real samples by analyzing the synthetic samples reconstructed from real samples through the diffusion and reverse process. By varying the time steps starting the reverse process in the reconstruction, we can control the trade-off between the information in the original real data and the information added by diffusion models. Through assessing the reconstructed samples and trained models, we found that the synthetic data are concentrated in modes of the training data distribution as the reverse step increases, and thus, they are difficult to cover the outer edges of the distribution. Our findings imply that modern diffusion models are insufficient to replicate training data distribution perfectly, and there is room for the improvement of generative modeling in the replication of training datasets.

classifier, diffusion model, synthetic sample, (15 more...)

arXiv.org Artificial Intelligence

2311.1309

Country: Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)

Genre: Research Report > New Finding (0.89)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Privacy Threats in Stable Diffusion Models

Cilloni, Thomas, Fleming, Charles, Walter, Charles

arXiv.org Artificial IntelligenceNov-15-2023

This paper introduces a novel approach to membership inference attacks (MIA) targeting stable diffusion computer vision models, specifically focusing on the highly sophisticated Stable Diffusion V2 by StabilityAI. MIAs aim to extract sensitive information about a model's training data, posing significant privacy concerns. Despite its advancements in image synthesis, our research reveals privacy vulnerabilities in the stable diffusion models' outputs. Exploiting this information, we devise a black-box MIA that only needs to query the victim model repeatedly. Our methodology involves observing the output of a stable diffusion model at different generative epochs and training a classification model to distinguish when a series of intermediates originated from a training sample or not. We propose numerous ways to measure the membership features and discuss what works best. The attack's efficacy is assessed using the ROC AUC method, demonstrating a 60\% success rate in inferring membership information. This paper contributes to the growing body of research on privacy and security in machine learning, highlighting the need for robust defenses against MIAs. Our findings prompt a reevaluation of the privacy implications of stable diffusion models, urging practitioners and developers to implement enhanced security measures to safeguard against such attacks.

diffusion model, inference attack, victim model, (15 more...)

arXiv.org Artificial Intelligence

2311.09355

Country:

North America > United States > Mississippi > Lafayette County > Oxford (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Clara County > Santa Clara (0.04)
(2 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.67)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.93)

Add feedback

Automated patent extraction powers generative modeling in focused chemical spaces

Subramanian, Akshay, Greenman, Kevin P., Gervaix, Alexis, Yang, Tzuhsiung, Gómez-Bombarelli, Rafael

arXiv.org Artificial IntelligenceJul-24-2023

Deep generative models have emerged as an exciting avenue for inverse molecular design, with progress coming from the interplay between training algorithms and molecular representations. One of the key challenges in their applicability to materials science and chemistry has been the lack of access to sizeable training datasets with property labels. Published patents contain the first disclosure of new materials prior to their publication in journals, and are a vast source of scientific knowledge that has remained relatively untapped in the field of data-driven molecular design. Because patents are filed seeking to protect specific uses, molecules in patents can be considered to be weakly labeled into application classes. Furthermore, patents published by the US Patent and Trademark Office (USPTO) are downloadable and have machine-readable text and molecular structures. In this work, we train domain-specific generative models using patent data sources by developing an automated pipeline to go from USPTO patent digital files to the generation of novel candidates with minimal human intervention. We test the approach on two in-class extracted datasets, one in organic electronics and another in tyrosine kinase inhibitors. We then evaluate the ability of generative models trained on these in-class datasets on two categories of tasks (distribution learning and property optimization), identify strengths and limitations, and suggest possible explanations and remedies that could be used to overcome these in practice.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1039/D3DD00041A

2303.08272

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Switzerland > Vaud (0.04)
(2 more...)

Genre: Research Report (0.66)

Industry:

Law > Intellectual Property & Technology Law (1.00)
Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Government > Regional Government > North America Government > United States Government > FDA (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Semi-supervised counterfactual explanations

Sajja, Shravan Kumar, Mukherjee, Sumanta, Dwivedi, Satyam

arXiv.org Artificial IntelligenceMar-22-2023

Counterfactual explanations for machine learning models are used to find minimal interventions to the feature values such that the model changes the prediction to a different output or a target output. A valid counterfactual explanation should have likely feature values. Here, we address the challenge of generating counterfactual explanations that lie in the same data distribution as that of the training data and more importantly, they belong to the target class distribution. This requirement has been addressed through the incorporation of auto-encoder reconstruction loss in the counterfactual search process. Connecting the output behavior of the classifier to the latent space of the auto-encoder has further improved the speed of the counterfactual search process and the interpretability of the resulting counterfactual explanations. Continuing this line of research, we show further improvement in the interpretability of counterfactual explanations when the auto-encoder is trained in a semi-supervised fashion with class tagged input data. We empirically evaluate our approach on several datasets and show considerable improvement in-terms of several metrics.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2303.12634

Country:

Asia > India (0.04)
North America > United States > Wisconsin (0.04)
North America > United States > Florida > Broward County (0.04)

Genre: Research Report (0.50)

Industry:

Banking & Finance (0.69)
Health & Medicine > Therapeutic Area (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Generating Accurate Virtual Examples For Lifelong Machine Learning

Mahfuz, Sazia

arXiv.org Artificial IntelligenceFeb-28-2023

Lifelong machine learning (LML) is an area of machine learning research concerned with human-like persistent and cumulative nature of learning. LML system's objective is consolidating new information into an existing machine learning model without catastrophically disrupting the prior information. Our research addresses this LML retention problem for creating a knowledge consolidation network through task rehearsal without retaining the prior task's training examples. We discovered that the training data reconstruction error from a trained Restricted Boltzmann Machine can be successfully used to generate accurate virtual examples from the reconstructed set of a uniform random set of examples given to the trained model. We also defined a measure for comparing the probability distributions of two datasets given to a trained network model based on their reconstruction mean square errors.

artificial intelligence, machine learning, probability distribution, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-030-18305-9_66

2302.14712

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Avoid These Data Pitfalls When Moving Machine Learning Applications Into Production

#artificialintelligenceJun-22-2021, 14:10:50 GMT

How often have you heard "The Machine Learning Application worked well in the lab, but it failed in the field. It is not the fault of the Machine Learning Model! This blog is not yet another blog article (YABA) on DataOps, DevOps, MLOps, or CloudOps. I do not mean to imply xOps is not essential. For example, MLOps is both strategic and tactical. It promises to transform the "ad-hoc" delivery of Machine Learning applications into software engineering best practices. We know the symptoms: Most machine-learning models trained in the lab perform poorly on real-world data [1, 2, 3, 4]. Machine Learning created profits in the year 2020 and will continue to increase profits in the future. However, many problems hold back the progress and success of Machine Learning application rollout to production. I focus on what it is the most significant problem or cause: the quality and quantity of input data in Machine Learning models [1,4]. We realized the quantity of high-quality data was the bottleneck in predictive accuracy when we started showing near, or above, human-level performance in structured data, imagery, game playing, and natural language tasks. How many times do we look at the Machine Learning application lifecycle's conceptualization to realize a Machine Learning model is not at the beginning (Figure 2)? We can research and improve the tools of the Machine Learning application lifecycle. But that only lowers the cost of deployment. Arguably, the Machine Learning model's choice is not a critical part of deploying a Machine Learning application. We have a "good enough" process or pipeline to choose and change the Machine Learning model, given a training input dataset. However, when achieving State-of-the-Art (SOTA) results, the input data seems to have the most significant impact on the output predictive data (Figure 2). We seem to know the cause: input data that was garbage results in garbage output predictive data. New data input to a trained Machine Learning model determines the accuracy of the output. We divide Machine Learning input data into four arbitrary categories, defined by the Machine Learning application output accuracy. GPT-3 is an example [6]. GPT-3 trained with an enormous amount of data [6]. GPT-3 is frozen in time as a transformer that you access through an API. Concept Drift is a change in what to predict. For example, the definition of "what is a spammer." We do not cover Concept Drift here. I do not think of it as a problem but rather as a change in the solution's scope. An example of Case 2: Data Drift, is that Case 1: "It works!, is a temporal phenomenon.

application, learning application, machine learning application, (15 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.78)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)

Add feedback